This paper proposes an approach to predict a 3D voxel occupancy map from 2D images, enabling open-vocabulary tasks like zero-shot segmentation and language-based retrieval. A model is designed with 2D-3D encoding, occupancy prediction, and 3D-language heads. It's trained via self-supervision using images, language features, and LiDAR, not needing manual 3D annotations. ...